Building an Evaluation Scale using Item Response Theory
نویسندگان
چکیده
Evaluation of NLP methods requires testing against a previously vetted gold-standard test set and reporting standard metrics (accuracy/precision/recall/F1). The current assumption is that all items in a given test set are equal with regards to difficulty and discriminating power. We propose Item Response Theory (IRT) from psychometrics as an alternative means for gold-standard test-set generation and NLP system evaluation. IRT is able to describe characteristics of individual items - their difficulty and discriminating power - and can account for these characteristics in its estimation of human intelligence or ability for an NLP task. In this paper, we demonstrate IRT by generating a gold-standard test set for Recognizing Textual Entailment. By collecting a large number of human responses and fitting our IRT model, we show that our IRT model compares NLP systems with the performance in a human population and is able to provide more insight into system performance than standard evaluation metrics. We show that a high accuracy score does not always imply a high IRT score, which depends on the item characteristics and the response pattern.
منابع مشابه
Psychometric Properties of State Level Subjective Vitality Scale based on classical test theory and Item-response theory
The purpose of the present study was to investigate the factor structure and Item-Response parameters of State Level of Subjective Vitality Scale. The research design was correlational, and the statistical population consisted of students of the Shahid Beheshti University of Tehran. Sample group including 240 students were selected through multi-stage sampling and completed Subjective Vitality ...
متن کاملPsychometric Properties of the Brief Form of Professor-Students Rapport Scale-based on Classical Test Theory and Item-Response Theory
Introduction: In order to improve the quality of the teaching process, it is necessary to review the professor-student rapport. The purpose of the present study was to investigate the factor structure and item-response parameters of Professor-Students Rapport Scale-Brief (PSRS-B). Methods: In a descriptive-correlation study, 497 students from Shahid Beheshti University of Medical Sciences were ...
متن کاملEvaluation Psychometric Characteristics of the Persian Version of the Colorado Learning Attitudes about Science Survey Using polytomous Item Response Model
Goal: Researchers in the field of science education believe that peoplechr(chr('39')39chr('39'))s attitudes about learning will have a significant impact on their future learning and what they learn from science will not be unrelated to their views and attitudes. Accordingly, most questionnaires have been developed to measure attitudes toward science, especially about physics learning attitudes...
متن کاملEvaluation of TLD Performance in Reducing the Seismic Response of Structures
Tuned Liquid Dampers (TLD) are among passive control devices that have been used to suppress the vibration of structures in recent years. These structures must be adequately presentable as an equivalent single degree of freedom system with long fundamental period. The TLD, located at the top floors of the structure, can dissipate the external input energy into the system through the sloshing ef...
متن کاملEvaluation of TLD Performance in Reducing the Seismic Response of Structures
Tuned Liquid Dampers (TLD) are among passive control devices that have been used to suppress the vibration of structures in recent years. These structures must be adequately presentable as an equivalent single degree of freedom system with long fundamental period. The TLD, located at the top floors of the structure, can dissipate the external input energy into the system through the sloshing ef...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing
دوره 2016 شماره
صفحات -
تاریخ انتشار 2016